
Update: Findings from this notebook were summarized in the Medium blogpost
In the curated COVID-19 datasets there are mainly cumulative descriptions of larger populations, e.g. country-level counts of infected people.
Being valuable, it tells us little about an age/gender group our parents belong to, or the group of our partners, or our siblings' age/gender.
from IPython.core.display import Image, display, HTML
display(Image("../input/helpingillustrations/ntk_pic.png"))
display(HTML("<style>.container { width:100% !important; }</style>"))
drawing credit: Dan Perjovschi
In this notebook I'll focus on person-level gender and age information.
In addition to the Roche UNCOVER Challenge I'm bringing detailed data about contracted cases in Czech republic. Please, see the dataset page for more details.
Firstly, we'll do exploratory data analysis. In the EDA the main goal would be to drill down into the persol-level details as much as possible. The detailed drill down would be illustrated on Czechia due to better data at hand.
Secondly, we'll target the following important question:
Does COVID-19 attack all age-gender groups evenly?
Thirdly, we'll attempt to quantitatively estimate by how much are selected age-gender groups more at risk of contracting COVID-19.
A note before we dive into:
The data are likely to be biased towards people who not only contracted COVID-19, but at the same time had moderate, severe, or critical symptoms. Let's still say "contracting COVID-19 instead of more precise but verbose contracting COVID-19 and having moderate, severe, or critical symptoms
# !pip install chart_studio
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from datetime import datetime
try:
from tqdm.notebook import tqdm
except:
from tqdm import tqdm_notebook as tqdm
import requests
import sys
from itertools import chain
import plotly.offline as py
py.init_notebook_mode(connected=True)
import plotly.graph_objs as go
import plotly.express as px
import plotly.io as pio
# import chart_studio
# import chart_studio.plotly as py
from scipy.stats import chisquare
from statsmodels.stats.proportion import multinomial_proportions_confint, proportion_confint
from statsmodels.stats.multitest import multipletests
import warnings
%matplotlib inline
export_plotly_figs_online = False
if export_plotly_figs_online:
username = '' # your username
api_key = '' # your api key - go to profile > settings > regenerate key
chart_studio.tools.set_credentials_file(username=username, api_key=api_key)
df_czechia = pd.read_csv('../input/covid19-the-czech-republic-personlevel-data/CZ_COVID19_persons_2020_04_13.csv')
df_czech_regions = pd.read_csv('../input/covid19-the-czech-republic-personlevel-data/CZ_region_codes.csv', index_col='region_code_NUTS3')
df_czechia['region_name'] = df_czechia['CZ_region_code_NUTS3'].map(df_czech_regions['region_name'])
df_czechia['gender'] = df_czechia['gender'].map(lambda x: 'Female' if x == 'Z' else 'Male' if x == 'M' else np.nan)
df_czechia['report_date'] = pd.to_datetime(df_czechia['report_date'])
df_czechia['report_date'] = df_czechia['report_date'].map(lambda x: x.date())
df_czechia.head()
df_czechia.isnull().sum()
sum(df_czechia['imported_case'].isnull() | df_czechia['country_of_exposure_csu_code'].isnull())/df_czechia['imported_case'].isnull().sum()
The attributes imported_case and country_of_exposure_csu_code have missing values sumaltenously only. Missing value means local exposure to COVID-19, inside Czechia.
df_czechia[df_czechia['country_of_exposure_csu_code'] == 'CZ']
These four records state exposure to COVID-19 outside The Czech Republic, while listing The Czech Republic as the country of exposure. Let's remove these noisy records, so that all cases of exposure inside CZ have consistent format.
df_czechia.loc[df_czechia['country_of_exposure_csu_code'] == 'CZ', 'imported_case'] = np.nan
df_czechia.loc[df_czechia['country_of_exposure_csu_code'] == 'CZ', 'country_of_exposure_csu_code'] = np.nan
plt.figure(figsize=(14, 7))
bins = df_czechia['age'].quantile(np.arange(0, 1.001, 0.1))
sns.distplot(df_czechia.loc[df_czechia['gender'] == 'Female', 'age'], bins=bins)
sns.distplot(df_czechia.loc[df_czechia['gender'] == 'Male', 'age'], bins=bins)
plt.legend(['Female', 'Male'])
plt.title('Czechia: age distributions of COVID-19 cases per gender', fontsize=21)
We can observe two interesting differences between the distributions of Females vs. Males. Firstly, it seems that Females in their 30's contract COVID-19 less frequently compared to Males in the same age group. The same holds for Females vs. Males in their 70's. Moreover, Women in there 30's have less corona cases than women in their 20's. We'll later check general population proportions and see if these observations can be due to demographics subtleties, sit tight!
plt.rcParams.update({'font.size': 15})
fig, ax = plt.subplots(figsize=(16, 7))
region_values = df_czechia['region_name'].value_counts(ascending=True)
region_values.plot(kind='barh', color=['dodgerblue'])
for i, v in enumerate(region_values):
ax.text(v + 1, i - 0.2, str(v), color='dodgerblue', fontweight='bold')
plt.title('Czechia: COVID-19 cases per regions', fontsize=21)
Oh, Prague.. It should be noted that based on official stats, overall population of Central Bohemian region is slightly higher than population of Prague. Yet, the population density in Prague is more than 30x higher.
What are age distributions per each region?
To answer this question, let's first create age groups.
df_czechia['age_group'] = pd.cut(df_czechia['age'],
bins = [0, 20] + list(range(30, 91, 10)) + [110],
right=False
)
Now, we'll create heatmap summarizing age distributions per region. Each row would represent one region, and columns would represent age groups. This way the heatmap basically provides info of the 3D histogram (I think of it as looking on the 3D histogram from above)
def plot_summarization_heatmap(gender, title,
bool_index=None,
second_dim_col='report_date',
color='red',
fig_width=800, fig_height=400,
log_scale=False,
upload_fig_filename=None
):
"""
Plots distribution of COVID-19 cases per Age group and Date as a heatmap.
Date is on the X axis, each row corresponds to a particular age group.
@param gender: string specifying gender of the visualization
@param title: plot title
@param bool_index: (optional) pandas Series or numpy array of booleans for boolean indexing of the Czechia dataframe
@param second_dim_col: (optional) column name for the Y axis of the heatmap, expected values are either 'report_date' (default), or 'Country Name', or 'region_name'
@param color: (optional) color for the colorscale, expected values are either 'red' (default), or 'amp'
@param fig_width: (optional) width of the plotly figure
@param fig_height: (optional) height of the plotly figure
@param log_scale: (optional) boolean flag to display color on log-scale
@param upload_fig_filename: (optional) when provided, the figure will be uploaded to your account with filename upload_fig_filename. It is assumed that plotly credentials has been set
"""
# checking inputs
assert color in ['red', 'amp'], "expected values for color are either 'red' (default), or 'amp'"
assert second_dim_col in ['report_date', 'Country Name', 'region_name'], "expected values for second_dim_col are either 'report_date' (default), or 'Country Name', or 'region_name'"
if bool_index is None:
bool_index = np.ones(len(df_czechia), dtype=bool)
vmax = (df_czechia[bool_index]
.groupby(['age_group', second_dim_col, 'gender'])['age']
.count()
.max())
bool_index &= df_czechia['gender']==gender
plt.figure(figsize=(14, 4))
values = (df_czechia[bool_index]
.groupby(['age_group', second_dim_col])['age']
.count()
.unstack()
.fillna(0)
)
values.index = list(map(str, values.index))
if second_dim_col != 'report_date':
values = values.T
x_label = 'Age group [years]'
y_label = f"{'Region' if second_dim_col == 'region_name' else 'Country'} Name"
else:
y_label = 'Age group [years]'
x_label = 'Date'
cmap = px.colors.sequential.Reds if color == 'red' else px.colors.sequential.amp
vmax_vis = round(vmax*4)
if log_scale:
colorscale= [[min(1, count_val/vmax_vis) if count_val > 0 else 0, color]
for count_val, color in zip([0] + list(np.logspace(0, np.log10(vmax_vis), len(cmap) - 1)), cmap)]
else:
colorscale= [[min(1.0, count_val/vmax_vis) if count_val > 0 else 0, color]
for count_val, color in zip(np.linspace(0, vmax_vis, len(cmap)), cmap)]
fig = go.Figure(data=go.Heatmap(z=values.values,
zmin=0,
zmax=vmax,
hovertemplate=str(x_label) + ': %{x}<br>' + str(y_label) + ': %{y}<br>Count: %{z}<extra></extra>',
x=values.columns,
y=values.index,
colorscale=colorscale,
colorbar = dict(title='Count'))
)
fig.update_xaxes(tickangle=45)
fig.update_layout(title=title, height=fig_height, width=fig_width,
xaxis_nticks=len(values.columns), yaxis_nticks=len(values.index),
xaxis=dict(title=x_label), yaxis=dict(title=y_label),
plot_bgcolor='rgba(0, 0, 0, 0)', paper_bgcolor='rgba(0, 0, 0, 0)',
margin={"r":0,"t":30,"l":0,"b":0, 'pad': 2}, )
if not upload_fig_filename is None:
py.plot(fig, filename = upload_fig_filename)
fig.show()
plot_summarization_heatmap('Female', 'Czechia, Females: cases in age groups per regions',
second_dim_col='region_name',
fig_width=600, fig_height=400,
color='amp',
upload_fig_filename='cz_females_age_region' if export_plotly_figs_online else None
)
Interestingly, it seems that Olomouc distribution is shifted towards younger women, while Ústí nad Labem seem to have slight shift towards older ladies.
When comparing official population statistics based on this dataset, average female age doesn't differ significantly for Prague, Olomouc region and Ústí nad Labem region, being 43.3, 43.2,a 44.1 respectively. Olomouc mean is actually higher. However, Olomouc is known to be a "student" city with up to 1/5 of all people there being students from all over the country/world (reference in Czech). So, the shift in Olomouc might be actually due to people studying there, but with official residence in a different region of the Republic.
There are many students in Prague as well, which might be behind higher count of infected women between 20-29 compared to women between 30-39.
Note: as the observations might suggest migration due to studies/work, for the planned statistical tests we'll use age-gender groups as categories, without further detalization into regions.
Analogously, let's check age distributions per region for men:
plot_summarization_heatmap('Male', 'Czechia, Males: cases in age groups per regions',
second_dim_col='region_name',
fig_width=600, fig_height=400,
color='amp',
upload_fig_filename='cz_males_age_region' if export_plotly_figs_online else None
)
We've previously seen that Women in their 30's have less COVID-19 cases compared to women in 20's. Now we can see that the same also holds for Men in several regions!
Computing total counts of COVID-19 positive cases per date in each region. To ensure that each region has a record for each date, we transform long-format data to the wide format and reindex by the date range, then filling missing values forward (so that days without new reported cases also have record), and then stack the dataframe back into the long format.
df_czechia_in_time = (df_czechia
.groupby(['region_name', 'report_date'])['age'].count()
.groupby(level=0)
.cumsum()
)
dates_available = df_czechia_in_time.index.get_level_values(1)
start_date = dates_available.min()
last_date = dates_available.max()
all_dates = pd.date_range(start_date, last_date)
df_czechia_in_time_wide = df_czechia_in_time.unstack(level=0)
df_czechia_in_time_wide = df_czechia_in_time_wide.reindex(all_dates)
# if there were no cases for the first date, fill in zero
df_czechia_in_time_wide.loc[start_date, df_czechia_in_time_wide.loc[start_date].isnull()] = 0
df_czechia_in_time_wide.fillna(method='ffill', inplace=True)
df_czechia_in_time = df_czechia_in_time_wide.stack().reset_index().rename(columns={'level_0': 'report_date', 0: 'Total number of COVID-19 cases'})
Let's validate our data and the aggregation procedure by comparing the overall aggregation we have with the available cumulative data.
open_data = pd.read_csv('../input/novel-corona-virus-2019-dataset/time_series_covid_19_confirmed.csv')
cz_open_data = (open_data.loc[open_data['Country/Region'] == 'Czechia']
.drop(['Lat', 'Long', 'Country/Region'], axis=1)
.stack()
.droplevel(level=0))
cz_open_data.index = pd.to_datetime(cz_open_data.index)
df_czechia_in_time_cumul = df_czechia_in_time.groupby('report_date')['Total number of COVID-19 cases'].sum()
plt.figure(figsize=(10,5))
ax = plt.axes()
cz_open_data.plot(ax=ax, linewidth=3, marker='o', label='Novel-Corona dataset summary data')
df_czechia_in_time_cumul.plot(ax=ax, linewidth=3, marker='o', label='Aggregation of person-level data')
plt.legend()
plt.title('Total cumulative count of COVID-19 cases', fontsize=21)
We can see that the overall trend is the same and the differences are minor, which provides us the validation we sought. In the public media it was reported that the individual data has been post-processed and cleaned, with multiple tests of the same person being omitted, it can explain the observed minor difference and would suggest that the aggregations based on person-level data might be more precise.
The following plotly visualizations are inspired with this amazing kernel.
Geographical data for plotting would be taken from https://github.com/deldersveld/topojson/blob/master/countries/czech-republic/czech-republic-regions.json.
# getting geo data for plotting
r = requests.get(url='https://raw.githubusercontent.com/deldersveld/topojson/master/countries/czech-republic/czech-republic-regions.json')
topology = r.json()
# Convert topology json into geojson
#The code is from https://gist.github.com/perrygeo/1e767e42e8bc54ad7262
def rel2abs(arc, scale=None, translate=None):
"""Yields absolute coordinate tuples from a delta-encoded arc.
If either the scale or translate parameter evaluate to False, yield the
arc coordinates with no transformation."""
if scale and translate:
a, b = 0, 0
for ax, bx in arc:
a += ax
b += bx
yield scale[0]*a + translate[0], scale[1]*b + translate[1]
else:
for x, y in arc:
yield x, y
def coordinates(arcs, topology_arcs, scale=None, translate=None):
"""Return GeoJSON coordinates for the sequence(s) of arcs.
The arcs parameter may be a sequence of ints, each the index of a
coordinate sequence within topology_arcs
within the entire topology -- describing a line string, a sequence of
such sequences -- describing a polygon, or a sequence of polygon arcs.
The topology_arcs parameter is a list of the shared, absolute or
delta-encoded arcs in the dataset.
The scale and translate parameters are used to convert from delta-encoded
to absolute coordinates. They are 2-tuples and are usually provided by
a TopoJSON dataset.
"""
if isinstance(arcs[0], int):
coords = [
list(
rel2abs(
topology_arcs[arc if arc >= 0 else ~arc],
scale,
translate )
)[::arc >= 0 or -1][i > 0:] \
for i, arc in enumerate(arcs) ]
return list(chain.from_iterable(coords))
elif isinstance(arcs[0], (list, tuple)):
return list(
coordinates(arc, topology_arcs, scale, translate) for arc in arcs)
else:
raise ValueError("Invalid input %s", arcs)
def geometry(obj, topology_arcs, scale=None, translate=None):
"""Converts a topology object to a geometry object.
The topology object is a dict with 'type' and 'arcs' items, such as
{'type': "LineString", 'arcs': [0, 1, 2]}.
See the coordinates() function for a description of the other three
parameters.
"""
return {
"type": obj['type'],
"coordinates": coordinates(
obj['arcs'], topology_arcs, scale, translate )}
from shapely.geometry import asShape
topojson_path = sys.argv[1]
geojson_path = sys.argv[2]
# file can be renamed, the first 'object' is more reliable
layername = list(topology['objects'].keys())[0]
features = topology['objects'][layername]['geometries']
scale = topology['transform']['scale']
trans = topology['transform']['translate']
fc = {'type': "FeatureCollection", 'features': []}
for id, tf in enumerate(features):
f = {'id': id, 'type': "Feature"}
f['properties'] = tf['properties'].copy()
geommap = geometry(tf, topology['arcs'], scale, trans)
geom = asShape(geommap).buffer(0)
assert geom.is_valid
f['geometry'] = geom.__geo_interface__
fc['features'].append(f)
Manually mapping the regions to map ids.
region_name_2_map_id = {'Ústí nad Labem Region': 0,
'South Bohemian Region': 1,
'South Moravian Region': 2,
'Karlovy Vary Region': 3,
'Hradec Králové Region': 4,
'Vysočina Region': 5,
'Liberec Region': 6,
'Moravian-Silesian Region': 7,
'Olomouc Region': 8,
'Pardubice Region': 9,
'Plzeň Region': 10,
'Prague': 11,
'Central Bohemian Region': 12,
'Zlín Region': 13}
df_czechia_in_time['map_id'] = df_czechia_in_time['region_name'].map(region_name_2_map_id)
df_czechia_in_time['Date'] = df_czechia_in_time['report_date'].map(lambda x: x.strftime('%Y-%m-%d'))
fig = px.choropleth(df_czechia_in_time,
geojson=fc,
locations='map_id',
animation_frame='Date',
color_continuous_scale=px.colors.sequential.amp,
hover_name='region_name',
range_color=(0, df_czechia_in_time['Total number of COVID-19 cases'].max()),
color='Total number of COVID-19 cases',
title='Czechia: COVID-19 patients per regions'
)
fig.update_geos(fitbounds="locations", visible=True)
fig.update_geos(projection_type="orthographic")
fig.update_layout(height=600, width=700, margin={"r":0,"t":30,"l":0,"b":0, 'pad': 2})
fig.show()
pio.write_html(fig, file='cz_covid_regions.html')
On the map above we're able to check precise numbers. Let's also animate log of the total COVID-19 cases to better see the spread in early days.
df_czechia_in_time['Log_2 of total number of COVID-19 cases'] = df_czechia_in_time['Total number of COVID-19 cases'].map(lambda x: np.log2(x) if x != 0 else 0)
fig = px.choropleth(df_czechia_in_time,
geojson=fc,
locations='map_id',
animation_frame='Date',
color_continuous_scale=px.colors.sequential.amp,
hover_data=['region_name', 'Total number of COVID-19 cases'],
range_color=(0, df_czechia_in_time['Log_2 of total number of COVID-19 cases'].max()),
color='Log_2 of total number of COVID-19 cases',
title='Czechia: COVID-19 patients per regions (log scale)'
)
fig.update_geos(fitbounds="locations", visible=True)
fig.update_geos(projection_type="orthographic")
fig.update_layout(height=600, width=700, margin={"r":0,"t":30,"l":0,"b":0, 'pad': 2})
fig.show()
Which age-gender groups contract COVID-19 more? And when?
Let's check how dynamics of new cases per age groups for women. On x axis we're going to have date, and each row corresponds to an age group.
plot_summarization_heatmap('Female', 'Czechia, Females: daily numbers of new COVID-19 cases',
upload_fig_filename='cz_females_age_date' if export_plotly_figs_online else None)
We can see that women between 40-49 have the most COVID-19 cases.
plot_summarization_heatmap('Male', 'Czechia, Males: daily numbers of new COVID-19 cases',
upload_fig_filename='cz_males_age_date' if export_plotly_figs_online else None)
The detailed visualization enables us to check, for instance, groups of older people. To compare genders, the color scale is the same in the both figures. We can see that in the beginning of epidemic men in their 60's had one of the highest daily counts of new COVID-19 cases among all age groups. We know that the Czech government has implemented additional measures of protecting older people rather early on. And in the data we can see that for people in 60's the daily increase of patients does not grow as it does for younger age groups, even though the beginning looked similar. For a week or so the older people counts actually went down. However, later the counts started to pop, especially for men it got closer to other age groups on the same date. Another troubling observation is the increase in the new daily counts for Females in their 40's. Thankfully, lately the daily increases for this group look better, but still the group remains a leader in new cases daily.
Let's compute and visualize weekly proportions of female COVID-19 patients in each Czech region.
# prepare_female_proportion_in_time_data
df_czechia['report_week'] = df_czechia['report_date'].map(lambda x: x.isocalendar()[1])
week_2_start_date = df_czechia.groupby('report_week')['report_date'].min()
df_czechia['Week Start Date'] = df_czechia['report_week'].map(week_2_start_date).map(str)
df_czechia.drop('report_week', axis=1, inplace=True)
df_czechia_gender_in_time = (df_czechia
.groupby(['region_name', 'Week Start Date'], as_index=False)['gender']
.agg(list)
)
df_czechia_gender_in_time['Proportion of New Female Patients [%]'] = df_czechia_gender_in_time['gender'].map(lambda x: int(100*sum([gender == 'Female' for gender in x])/len(x)))
df_czechia_gender_in_time['Patients count'] = df_czechia_gender_in_time['gender'].map(len)
df_czechia_gender_in_time.sort_values(by=['Week Start Date', 'region_name'], inplace=True)
df_czechia_gender_in_time['map_id'] = df_czechia_gender_in_time['region_name'].map(region_name_2_map_id)
# vizualize_female_proportion_in_time
fig = px.choropleth(df_czechia_gender_in_time,
geojson=fc,
locations='map_id',
animation_frame='Week Start Date',
color_continuous_scale=px.colors.diverging.RdBu_r,
hover_data=['region_name', 'Patients count'],
range_color=(0, 100),
color='Proportion of New Female Patients [%]',
title='Czechia: Proportion of New Female patients per regions'
)
fig.update_geos(fitbounds="locations", visible=True)
fig.update_geos(projection_type="orthographic")
fig.update_layout(height=600, width=700, margin={"r":0,"t":30,"l":0,"b":0, 'pad': 2})
fig.show()
pio.write_html(fig, file='cz_covid_what_gender_when_where.html')
The visualization enables us to see that counts of Female and Male patients differ, even for larger samples in later weeks, when there're many new COVID-19 patients during later weeks. The week started on the March 30 is especially troubling to me. In all regions there're more Females among new COVID-19-positive patients. Based on our drill-down to age groups, we know that people in 40's contribute to new cases the most. Yet, based on official demographics, there're less Females in this age category than Males. In total, there're slightly more Women (50.8%), but it cannot explain systematically higher proportions of new Female patients.
Let's compute and visualize weekly mean of COVID-19 patients age in each Czech region.
df_czechia_age_in_time = (df_czechia
.groupby(['region_name', 'Week Start Date'], as_index=False)['age']
.agg({'Patients Count': len, 'Average Age of New Patients': np.mean})
)
df_czechia_age_in_time.sort_values(by=['Week Start Date', 'region_name'], inplace=True)
df_czechia_age_in_time['map_id'] = df_czechia_age_in_time['region_name'].map(region_name_2_map_id)
fig = px.choropleth(df_czechia_age_in_time,
geojson=fc,
locations='map_id',
animation_frame='Week Start Date',
color_continuous_scale='blues',
hover_data=['region_name', 'Patients Count'],
range_color=(0, df_czechia_age_in_time['Average Age of New Patients'].max()),
color='Average Age of New Patients',
title='Czechia: Average Age of New Patients per Regions'
)
fig.update_geos(fitbounds="locations", visible=True)
fig.update_geos(projection_type="orthographic")
fig.update_layout(height=600, width=700, margin={"r":0,"t":30,"l":0,"b":0, 'pad': 2})
fig.show()
pio.write_html(fig, 'cz_how_old_when_where.html')
Let's map the country codes to full names and check overall counts of COVID-19 cases imported into Czechia from each country.
Also, for easier visualization with plotly, let's map 2-letter country codes to 3-leter country codes.
df_country_codes = pd.read_csv('../input/covid19-the-czech-republic-personlevel-data/country_codes', sep=';')
country_code_mapping = df_country_codes.set_index('2let')['3let']
country_code_2_name_mapping = df_country_codes.set_index('2let')['Countrylet']
df_czechia['Country Code'] = df_czechia['country_of_exposure_csu_code'].map(country_code_mapping)
fig, ax = plt.subplots(figsize=(14, 19))
vals = (df_czechia.loc[df_czechia['imported_case']==1, 'Country Code']
.value_counts(ascending=True)
)
vals.plot(kind='barh', color=['dodgerblue'], ax=ax)
plt.title('Czechia: contries of exposure for imported COVID-19 cases', fontsize=21)
for i, v in enumerate(vals):
ax.text(v + 1, i - 0.25, str(v), color='dodgerblue', fontweight='bold')
with warnings.catch_warnings():
warnings.simplefilter('ignore')
df_czechia['imported_case'] = df_czechia['imported_case'].fillna(0)
_, axes = plt.subplots(2, 1, figsize=(12, 10), sharex=True)
propotion_of_imports = (df_czechia
.groupby('report_date')['imported_case']
.mean()
.map(lambda x: int(100*x)))
proportion_of_local_cases = propotion_of_imports.map(lambda x: 100 - x)
axes[1].bar(propotion_of_imports.index, propotion_of_imports, color='red', label='Imported cases')
axes[1].bar(propotion_of_imports.index, proportion_of_local_cases, color='brown', bottom=propotion_of_imports, label='Local cases')
axes[1].set_ylabel('Proportion out of all cases [%]')
axes[1].legend()
(df_czechia
.groupby('report_date')['imported_case']
.sum()).plot(ax=axes[0], linewidth=3, marker='o')
axes[0].set_ylabel('Number of imported COVID-19 cases')
axes[1].set_xlabel('Date')
plt.xticks(rotation=30)
plt.suptitle('Dynamics of Imported COVID-19 cases', fontsize=24)
Computing and visualizing cumulative counts for each date.
df_czechia_imports_in_time = (df_czechia[df_czechia['imported_case']==1]
.groupby(['Country Code', 'report_date'])['age'].count()
.groupby(level=0)
.cumsum())
df_czechia_imports_in_time_wide = df_czechia_imports_in_time.unstack(level=0)
df_czechia_imports_in_time_wide = df_czechia_imports_in_time_wide.reindex(all_dates)
df_czechia_imports_in_time_wide.fillna(method='ffill', inplace=True)
df_czechia_imports_in_time = df_czechia_imports_in_time_wide.stack().reset_index().rename(columns={'level_0': 'report_date', 0: 'Total imported COVID-19 cases'})
df_czechia_imports_in_time['Date'] = df_czechia_imports_in_time['report_date'].map(lambda x: x.strftime('%Y-%m-%d'))
fig = px.choropleth(df_czechia_imports_in_time,
locations='Country Code',
animation_frame='Date',
hover_name='Country Code',
range_color=(0, df_czechia_imports_in_time['Total imported COVID-19 cases'].max()),
color='Total imported COVID-19 cases',
title='Czechia: Cumulative Count of COVID-19 Imports per Country of Exposure'
)
fig.update_geos(fitbounds="locations", visible=True)
fig.update_layout(height=600, margin={"r":0,"t":30,"l":0,"b":0, 'pad': 2})
fig.show()
pio.write_html(fig, file='cz_covid_where_from_when_cumulative.html')
Computing weekly counts of new imported COVID-19 cases per exposure contry.
df_czechia_weekly_imports_in_time = (df_czechia[df_czechia['imported_case']==1]
.groupby(['Country Code', 'Week Start Date'], as_index=False)['age']
.count()
.rename(columns={'age':'New imported COVID-19 cases'})
)
df_czechia_weekly_imports_in_time.sort_values(by=['Week Start Date', 'Country Code'], inplace=True)
fig = px.choropleth(df_czechia_weekly_imports_in_time,
locations='Country Code',
animation_frame='Week Start Date',
hover_name='Country Code',
range_color=(0, df_czechia_weekly_imports_in_time['New imported COVID-19 cases'].max()),
color='New imported COVID-19 cases',
title='Czechia: Weekly Imported COVID-19 cases per Country of Exposure'
)
fig.update_geos(fitbounds="locations", visible=True)
fig.update_layout(height=600, margin={"r":0,"t":30,"l":0,"b":0, 'pad': 2})
fig.show()
pio.write_html(fig, file='cz_covid_where_from_when_weekly.html')
df_czechia_weekly_imports_gender = (df_czechia[df_czechia['imported_case']==1]
.groupby('Week Start Date')['gender']
.agg(list))
females_proportions = df_czechia_weekly_imports_gender.map(lambda x: int(100*sum([gender == 'Female' for gender in x])/len(x)))
males_proportions = df_czechia_weekly_imports_gender.map(lambda x: int(100*sum([gender == 'Male' for gender in x])/len(x)))
plt.figure(figsize=(12, 7))
xticks = list(range(len(females_proportions)))
plt.bar(xticks, males_proportions, color='tab:blue', edgecolor='white', label='Proportion of New Male Patients [%]')
plt.bar(xticks, females_proportions, bottom=males_proportions,
color='tab:pink', edgecolor='white', label='Proportion of New Female Patients [%]')
plt.plot([-.3] + xticks + [len(females_proportions)-.7], 50*np.ones(len(females_proportions) + 2),
linestyle='--', linewidth=3, c='red', label='50% line')
plt.xticks(xticks, females_proportions.index, rotation=30)
plt.xlabel('Week Start Date', fontsize=18)
plt.ylabel('[%]')
plt.legend(loc='upper left', bbox_to_anchor=(1,1))
plt.title('Czechia, Weekly gender proportions among patients importing COVID-19', fontsize=21)
plt.figure(figsize=(12,7))
ax = sns.countplot(x='Week Start Date', hue='gender', data=df_czechia[df_czechia['imported_case']==1])
plt.xticks(rotation=30)
plt.legend(title='Gender')
plt.ylabel('Weekly Count', fontsize=18)
plt.title('Czechia, new imported cases per gender')
plt.show()
df_czechia_weekly_imports_gender_in_time = (df_czechia[df_czechia['imported_case']==1]
.groupby(['Country Code', 'Week Start Date'], as_index=False)['gender']
.agg(list)
)
df_czechia_weekly_imports_gender_in_time['Proportion of New Female Patients [%]'] = df_czechia_weekly_imports_gender_in_time['gender'].map(lambda x: int(100*sum([gender == 'Female' for gender in x])/len(x)))
df_czechia_weekly_imports_gender_in_time['Patients Count'] = df_czechia_weekly_imports_gender_in_time['gender'].map(len)
df_czechia_weekly_imports_gender_in_time.sort_values(by=['Week Start Date', 'Country Code'], inplace=True)
\
fig = px.choropleth(df_czechia_weekly_imports_gender_in_time,
locations='Country Code',
animation_frame='Week Start Date',
color_continuous_scale=px.colors.diverging.RdBu_r,
hover_data=['Country Code', 'Patients Count'],
range_color=(0, 100),
color='Proportion of New Female Patients [%]',
title='Czechia: Proportion of Females among Patients Importing COVID-19, per Country of Exposure'
)
fig.update_geos(fitbounds="locations", visible=True)
fig.update_layout(height=600, margin={"r":0,"t":30,"l":0,"b":0, 'pad': 2})
fig.show()
pio.write_html(fig, file='cz_covid_where_from_what_gender_when_weekly.html')
df_czechia_italy_in_time = (df_czechia[df_czechia['Country Code'] == 'ITA']
.groupby(['region_name', 'Week Start Date'], as_index=False)['gender']
.agg({'Count': 'count', 'Percentage of Females':lambda x: int(100*sum([gender == 'Female' for gender in x])/len(x))})
)
df_czechia_italy_in_time.sort_values(by=['Week Start Date', 'region_name'], inplace=True)
df_czechia_italy_in_time['map_id'] = df_czechia_italy_in_time['region_name'].map(region_name_2_map_id)
fig = px.choropleth(df_czechia_italy_in_time,
geojson=fc,
locations='map_id',
animation_frame='Week Start Date',
color_continuous_scale=px.colors.sequential.amp,
hover_data=['region_name', 'Percentage of Females'],
range_color=(0, df_czechia_italy_in_time['Count'].max()),
color='Count',
title='COVID-19 imports from Italy: into which regions?'
)
fig.update_geos(fitbounds="locations", visible=True)
fig.update_geos(projection_type="orthographic")
fig.update_layout(height=600, width=700, margin={"r":0,"t":30,"l":0,"b":0, 'pad': 2})
fig.show()
df_czechia_aut_in_time = (df_czechia[df_czechia['Country Code'] == 'AUT']
.groupby(['region_name', 'Week Start Date'], as_index=False)['gender']
.agg({'Count': 'count', 'Percentage of Females':lambda x: int(100*sum([gender == 'Female' for gender in x])/len(x))})
)
df_czechia_aut_in_time.sort_values(by=['Week Start Date', 'region_name'], inplace=True)
df_czechia_aut_in_time['map_id'] = df_czechia_aut_in_time['region_name'].map(region_name_2_map_id)
fig = px.choropleth(df_czechia_aut_in_time,
geojson=fc,
locations='map_id',
animation_frame='Week Start Date',
color_continuous_scale=px.colors.sequential.amp,
hover_data=['region_name', 'Percentage of Females'],
range_color=(0, df_czechia_aut_in_time['Count'].max()),
color='Count',
title='COVID-19 imports from Austria: into which regions?'
)
fig.update_geos(fitbounds="locations", visible=True)
fig.update_geos(projection_type="orthographic")
fig.update_layout(height=600, width=700, margin={"r":0,"t":30,"l":0,"b":0, 'pad': 2})
fig.show()
def plot_imported_cases_per_age(gender):
"""
The function computes and plots number of imported cases per each age group of the requested gender
@param gender: a string specifying gender (valid values are 'Female'/'Male')
"""
df_czechia['age_group_'] = pd.cut(df_czechia['age'],
bins = [0, 12] + list(range(17, 81, 5)) + [110],
right=False
)
plt.figure(figsize=(9,5))
(df_czechia[df_czechia['gender'] == gender]
.groupby('age_group_')['imported_case']
.sum()
).plot(kind='bar')
plt.xlabel('Age group [years]')
plt.xticks(rotation=50)
plt.ylabel('Number of imported cases')
plt.ylim((0, df_czechia
.groupby(['gender', 'age_group_'])['imported_case']
.sum()
.max()*1.1))
df_czechia.drop('age_group_', axis=1, inplace=True)
plt.title(f'Czechia, {gender}s: Number of imported cases per age', fontsize=18)
plot_imported_cases_per_age('Female')
plot_imported_cases_per_age('Male')
df_czechia_weekly_imports_age_in_time = (df_czechia
.groupby(['country_of_exposure_csu_code', 'Week Start Date'], as_index=False)['age']
.agg({'Patients Count': len, 'Average Age of New imported COVID-19 cases': np.mean})
)
df_czechia_weekly_imports_age_in_time.sort_values(by=['Week Start Date', 'country_of_exposure_csu_code'], inplace=True)
df_czechia_weekly_imports_age_in_time['Country Code'] = df_czechia_weekly_imports_age_in_time['country_of_exposure_csu_code'].map(country_code_mapping)
fig = px.choropleth(df_czechia_weekly_imports_age_in_time,
locations='Country Code',
animation_frame='Week Start Date',
color_continuous_scale='blues',
hover_data=['Country Code', 'Patients Count'],
range_color=(0, df_czechia_weekly_imports_age_in_time['Average Age of New imported COVID-19 cases'].max()),
color='Average Age of New imported COVID-19 cases',
title='Czechia: Average Age of Patients Importing COVID-19, per Country of Exposure'
)
fig.update_geos(fitbounds="locations", visible=True)
fig.update_layout(height=600, margin={"r":0,"t":30,"l":0,"b":0, 'pad': 2})
pio.write_html(fig, file='cz_covid_where_from_how_old_when_weekly.html')
fig.show()
df_czechia['Country Name'] = df_czechia['country_of_exposure_csu_code'].map(country_code_2_name_mapping)
plot_summarization_heatmap('Female', 'Czech Females: Counts of Imported Cases Per Country',
bool_index=df_czechia['imported_case']==1, second_dim_col='Country Name',
fig_width=500, fig_height=700,
upload_fig_filename='cz_imported_females_age_country' if export_plotly_figs_online else None
)
plot_summarization_heatmap('Male', 'Czech Males: Counts of Imported Cases Per Country',
bool_index=df_czechia['imported_case']==1, second_dim_col='Country Name',
fig_width=500, fig_height=700,
upload_fig_filename='cz_imported_males_age_country' if export_plotly_figs_online else None
)
plot_summarization_heatmap('Female', 'Czechia, Females: Daily Counts of New Imported Cases',
bool_index=df_czechia['imported_case']==1,
upload_fig_filename='cz_imported_females_age_date' if export_plotly_figs_online else None)
plot_summarization_heatmap('Male', 'Czechia, Males: Daily Counts of New Imported Cases',
bool_index=df_czechia['imported_case']==1,
upload_fig_filename='cz_imported_males_age_date' if export_plotly_figs_online else None)
Interesting to see, that young men were importing COVID-19 as a leading age groupy mainly in the beginning, while young women are consistently importing COVID-19 as one of the most active female age groups.
Before moving on to Canada data, let's compute COVID-19 age-gender group proportions and reorganize data into long format, so that it'd be more straightforward to include general population proportions later on.
counts_1 = df_czechia.loc[df_czechia['gender'] == 'Female', 'age_group'].value_counts()
counts_2 = df_czechia.loc[df_czechia['gender'] == 'Male', 'age_group'].value_counts()
sample_size_cz = np.sum(counts_1) + np.sum(counts_2)
counts_1 /= 0.01*sample_size_cz
counts_2 /= 0.01*sample_size_cz
y_max = 1.1 * max(max(counts_1), max(counts_2))
all_counts = pd.DataFrame({'Female': counts_1, 'Male': counts_2})
df_proportions_czechia = (all_counts
.stack()
.reset_index()
.rename(columns={'level_0':'age_group', 'level_1':'gender', 0: 'Sample proportion [%]'}))
df_proportions_czechia['age_group'] = df_proportions_czechia['age_group'].astype(str)
#Let's start with the `covid_19_canada_open_data_working_group` dataset.
df_canada_open = pd.read_csv('../input/uncover/UNCOVER/covid_19_canada_open_data_working_group/public-covid-19-cases-canada.csv')
df_canada_open.shape
df_canada_open['age'].value_counts().plot(kind='barh', color=['dodgerblue'])
plt.title('Available Age Info in the Canada Open Data dataset')
# df_canada_tracker = pd.read_csv('../input/uncover/covid_tracker_canada/covid-19-tracker-canada.csv')
# print(f'Canada tracker dataset shape: {df_canada_tracker.shape}')
# df_canada_tracker['age'].value_counts().plot(kind='barh')
# plt.title('Available age info in the Canada COVID-19 Tracker Dataset')
Unfortunately, the same hold for covid_tracker_canada dataset: for the majority of records age is missing.
COVID-19 tracker for Canada seems to provide similar info, but without gender info and it has slighty less records.
So, let's use the covid_19_canada_open_data_working_group dataset.
print(f"Count of values in each age category: \n{df_canada_open['age'].value_counts()[1:]}")
Given the counts, let's create a category 0-19, and unify all age records.
df_canada_open.loc[df_canada_open['age'].isin({'<18', '<1', '2', '<10', '<20', '10-19'}), 'age'] = '0-19'
df_canada_open.loc[df_canada_open['age'] == '61', 'age'] = '60-69'
df_canada_open.loc[df_canada_open['age'] == '50', 'age'] = '50-59'
print(f"Count of values in each age category: \n{df_canada_open['age'].value_counts()[1:]}")
df_canada_open.loc[df_canada_open['age'] != 'Not Reported', 'sex'].value_counts().plot(kind='barh', color=['dodgerblue'])
plt.title('Available gender info for known age in the Open Data')
For the majority of cases with reported age, gender is known as well. It is good news as it would enable us to conduct the detailed analysis.
Total number of datapoints with known gender and sex:
sample_size_canada = sum((df_canada_open['age'] != 'Not Reported') & (df_canada_open['sex'] != 'Not Reported'))
print(f'Number of cases with known age and sex: {sample_size_canada}')
Thankfully, we have at least something available to work with.
female_bool_idx = (df_canada_open['age'] != 'Not Reported') & (df_canada_open['sex'] == 'Female')
male_bool_idx = (df_canada_open['age'] != 'Not Reported') & (df_canada_open['sex'] == 'Male')
min(df_canada_open.loc[female_bool_idx, 'age'].value_counts().min(),
df_canada_open.loc[male_bool_idx, 'age'].value_counts().min())
female_age_group_sizes = df_canada_open.loc[female_bool_idx, 'age'].value_counts(dropna=False)
print(f"Number of persons in small age categories for women: \n{female_age_group_sizes[female_age_group_sizes<10]}")
male_age_group_sizes = df_canada_open.loc[male_bool_idx, 'age'].value_counts(dropna=False)
print(f"\n\nNumber of persons in small age categories for men: \n{male_age_group_sizes[male_age_group_sizes<10]}")
The age group 90-99 will have to be joined for men and women, to ensure more than 10 representatives for each group.
For other groups we can conduct analysis separately for men and women.
joined_age_groups = ['90-99']
gender_group_list = []
age_group_list = []
group_size_list = []
numeric_age_groups = sorted([group_name for group_name in df_canada_open['age'].unique() if group_name != 'Not Reported'])
for age_group in numeric_age_groups:
female_count = female_age_group_sizes[age_group]
male_count = male_age_group_sizes[age_group]
if age_group in joined_age_groups:
gender_group_list.append('both-genders')
joined_count = female_count + male_count
group_size_list.append(joined_count)
age_group_list.append(age_group)
else:
gender_group_list.extend(['Females', 'Males'])
age_group_list.extend([age_group, age_group])
group_size_list.extend([female_count, male_count])
df_proportions_canada = pd.DataFrame({'gender': gender_group_list, 'age_group': age_group_list,
'Sample proportion [%]': [100*group_size/sample_size_canada for group_size in group_size_list]})
pd.pivot_table(df_proportions_canada, index='age_group', columns=['gender'], values='Sample proportion [%]').plot.bar(rot=90, figsize=(12, 8))
plt.title('Canada: Age distributions of COVID-19 cases per gender', fontsize=18)
plt.ylabel('Proportion [%]')
We're going to formulate a null hypothesis that COVID-19 attacks any person at random, not taking age-gender grouping into account.
We'll see that the null hypothesis will be rejected with a very low type I error. It's formal statistical evidence that there are age-gender population groups with significantly higher risk of contracting COVID-19.
The COVID-19 news can easily bring one down. Yet, to stay effective in this fight agains corona, let's occasionally lift our spirits with some stats jokes
%%HTML
<img src=https://i.pinimg.com/originals/a1/ee/ff/a1eeffe12fd3db681208ad5f0387905c.png width="150" align="left">
I've downloaded the general population data for Czechia and Canada, and put them into public datasets.
def get_czech_population_gender_all(path_to_group_file):
"""
A function for loading The Czech Republic Age Gender Demographics 2018 input file.
@param path_to_group_file: A path to input file. In the dataset each gender info is stored in a separate file
@return: a pandas DataFrame with columns 'Age', 'VALUE'
"""
czech_population_group = pd.read_csv(path_to_group_file, low_memory=False)
czech_population_group.loc[czech_population_group['Age'] == '100+', 'Age'] = 100
czech_population_group['Age'] = czech_population_group['Age'].map(int)
czech_population_group['VALUE'] = czech_population_group['CZ0\r\nČR'].map(lambda x: int(x.replace(',','')))
return czech_population_group[['Age', 'VALUE']]
czech_population_female = get_czech_population_gender_all('../input/the-czech-republic-age-gender-demographics-2018/CZ_demographics_women_2018.csv')
czech_population_male = get_czech_population_gender_all('../input/the-czech-republic-age-gender-demographics-2018/CZ_demographics_men_2018.csv')
Let's compute general population proportions per age groups we've introduced for COVID-19 patients.
def derive_age_category_cz(df, age_start, age_final):
"""
The function computes counts in the general Czech population proportions per age groups found in data on Czech COVID-19 patients.
The computed record is appended to the general population DataFrame.
@param df: The general population pandas DataFrame
@param age_start: lower bound of the new age group interval (included)
@param age_final: upper bound of the new age group interval (excluded)
@return: the general population pandas DataFrame updated with the newly computed category
"""
row_val = 0
i = 0
while age_start + i < age_final:
if np.any(df['Age'] == age_start + i):
row_val += df.loc[df['Age'] == age_start + i, 'VALUE'].values[0]
i += 1
row_df = pd.DataFrame({'gender': [df['gender'].values[0]], 'age_group': [f'[{age_start}, {age_final})'], 'VALUE': [row_val]})
return df.append(row_df, ignore_index=True)
czech_age_groups = df_czechia['age_group'].map(str).unique()
czech_population_male['gender'] = 'Male'
czech_population_female['gender'] = 'Female'
for age_group_name in tqdm(czech_age_groups, desc='Deriving age groups..'):
czech_population_female = derive_age_category_cz(czech_population_female, *map(int, age_group_name[1:-1].split(', ')))
czech_population_male = derive_age_category_cz(czech_population_male, *map(int, age_group_name[1:-1].split(', ')))
czech_population_joined = pd.concat((czech_population_female, czech_population_male), sort=False)
the_whole_population_size = czech_population_joined.loc[czech_population_joined['Age'] == -1, 'VALUE'].sum()
df_proportions_czechia['True population proportion [%]'] = np.nan
for index, row in df_proportions_czechia.iterrows():
true_population_count = czech_population_joined.loc[(czech_population_joined['age_group'] == row['age_group']) &
(czech_population_joined['gender'] == row['gender']) , 'VALUE'].values[0]
df_proportions_czechia.loc[index, 'True population proportion [%]'] = 100 * true_population_count / the_whole_population_size
def compare_proportions_visually(df, gender, title='', covid_col='Sample proportion [%]', return_pivoted_table=False):
"""
The function prepare data for visual comparison of age-gender group proportions
between COVID-19 patients and general population.
By default the function creates a comparative plot.
For a custom comparative visualization one can specify return_pivoted_table=True,
in this case data for comparative visualization are returned without plotting default visualization.
@param df: a pandas DataFrame with population and COVID-19 proportions, containing columns ['age_group', 'gender', <covid_col>, 'True population proportion [%]']
@param gender: a gender visual comparison must be prepared for
@param title: a title for default visualization
@param covid_col: a column name with COVID-19 proportion (can 'Sample proportion [%]' or confidence interval lb)
@param return_pivoted_table: a boolean flag to return prepared data for custom visualization
@return: None when return_pivoted_table=False, otherwise a pandas DataFrame containing columns 'COVID-19', and 'General Population', indexed by age groups
"""
sample_proportion = df.loc[df['gender'] == gender,
['age_group', covid_col]].rename(columns={covid_col: 'Proportion'})
sample_proportion['Dataset'] = 'COVID-19'
population_proportion = df.loc[df['gender'] == gender,
['age_group', 'True population proportion [%]']].rename(columns={'True population proportion [%]': 'Proportion'})
population_proportion['Dataset'] = 'General Population'
gender_proportions = pd.concat((sample_proportion, population_proportion), sort=False)
gender_proportions_pivoted_table = pd.pivot_table(gender_proportions,
index='age_group',
columns=['Dataset'],
values='Proportion')
if return_pivoted_table:
return gender_proportions_pivoted_table
gender_proportions_pivoted_table.plot.bar(rot=30, figsize=(12, 7))
plt.xlabel('Age Group [years interval]')
plt.ylabel('Proportion [%]')
plt.title(title, fontsize=18)
compare_proportions_visually(df_proportions_czechia, 'Female',
'Czechia, Females: COVID-19 proportions vs. Population proportions')
compare_proportions_visually(df_proportions_czechia, 'Male',
'Czechia, Males: COVID-19 proportions vs. Population proportions')
An interesting observation
We can see that in Czechia we seemingly manage to protect older women such that there're less of them among COVID-19 patients compared to the population proportion. However, it doesn't hold for older men, even though the state policies are the same (like in the morning only older people are allowed to visit groceries/post office). I believe it's because Czech men are true gentlemen protecting their women whenever possible. Another possible explanation might be that older men are harder to help to compared to older women.
Note the difference in proportion of young people. As this group is an obvious outliner (either due to being more resilient, or due to being more carefully protected, or both), let's also check proportions among population older than 20.
def get_20_plus_proportions(df_proportions, youth_category='[0, 20)'):
"""
A function returns age category proportions in population excluding 'youth_category'
@param df_proportions: a pandas DataFrame with population and COVID-19 proportions, containing columns ['age_group', 'gender', <covid_col>, 'True population proportion [%]']
@param youth_category: a category in the 'age_group' column to be excluded
@return: a pandas DataFrame with proportions recomputed after excluding 'youth_category'
"""
df_proportions_20plus = df_proportions[df_proportions['age_group'] != youth_category].copy()
df_proportions_20plus['Sample proportion [%]'] = 100*df_proportions_20plus['Sample proportion [%]'] / df_proportions_20plus['Sample proportion [%]'].sum()
df_proportions_20plus['True population proportion [%]'] = 100*df_proportions_20plus['True population proportion [%]'] / df_proportions_20plus['True population proportion [%]'].sum()
return df_proportions_20plus
df_proportions_czechia_20plus = get_20_plus_proportions(df_proportions_czechia)
compare_proportions_visually(df_proportions_czechia_20plus, 'Female',
'Czechia, Females over 20: COVID-19 proportions vs. Population proportions')
compare_proportions_visually(df_proportions_czechia_20plus, 'Male',
'Czechia, Males over 20: COVID-19 proportions vs. Population proportions')
Now we're going to test null hypothesis that COVID-19 attacks any person at random, not taking age-gender grouping into account.
A friend of mine recommended me to provide some pointers for those, who'd like to refresh theoretical basics. Here one can find nice lecture notes: from Stanford's "Introduction to Statistical Methods".
And here's a short video from Khan academy:
%%HTML
<iframe width="560" height="315" src="https://www.youtube.com/embed/2QeDRsxSF9M?rel=0&controls=1&showinfo=0" frameborder="0" allowfullscreen></iframe>
We have verified that each age category in the COVID-19 samples from both Czechia and Canada has at least 10 points, so let's proceed with the testing.
chisquare((df_proportions_czechia['Sample proportion [%]']*sample_size_cz/100).map(int).values,
(df_proportions_czechia['True population proportion [%]']*sample_size_cz/100).map(int).values)
chisquare((df_proportions_czechia.loc[df_proportions_czechia['gender'] == 'Female', 'Sample proportion [%]']*sample_size_cz/100).map(int).values,
(df_proportions_czechia.loc[df_proportions_czechia['gender'] == 'Female', 'True population proportion [%]']*sample_size_cz/100).map(int).values)
chisquare((df_proportions_czechia_20plus.loc[df_proportions_czechia_20plus['gender'] == 'Female', 'Sample proportion [%]']*sample_size_cz/100).map(int).values,
(df_proportions_czechia_20plus.loc[df_proportions_czechia_20plus['gender'] == 'Female', 'True population proportion [%]']*sample_size_cz/100).map(int).values)
chisquare((df_proportions_czechia.loc[df_proportions_czechia['gender'] == 'Male', 'Sample proportion [%]']*sample_size_cz/100).map(int).values,
(df_proportions_czechia.loc[df_proportions_czechia['gender'] == 'Male', 'True population proportion [%]']*sample_size_cz/100).map(int).values)
chisquare((df_proportions_czechia_20plus.loc[df_proportions_czechia_20plus['gender'] == 'Male', 'Sample proportion [%]']*sample_size_cz/100).map(int).values,
(df_proportions_czechia_20plus.loc[df_proportions_czechia_20plus['gender'] == 'Male', 'True population proportion [%]']*sample_size_cz/100).map(int).values)
..Here goes another stats meme
%%HTML
<img src=https://i.pinimg.com/originals/e5/46/4b/e5464b76f64560a133d2827f7bb9eec5.jpg width="310" align="left">
We can see that even when focusing on females/males over 20, p-values are very low.
In our case p-value is probability of getting our result or even more extreme one (more different distributions) when COVID-19 attacks all age groups evenly/at random. Or in other words it's probability of rejecting hypothesis that COVID-19 attacks every age group evenly while in fact it's true (type I error).
Let's now separately test each age/gender group. We've already performed several tests not controlling increase in familywise error rate, considering our investigations preliminary. However, now we're going to perform 18 tests. Let's correct p-values using Holm correction.
df_proportions_czechia['p-value'] = np.nan
for index, row in df_proportions_czechia.iterrows():
covid_group_frequency = df_proportions_czechia.loc[index, 'Sample proportion [%]']*sample_size_cz/100
covid_others_frequency = sample_size_cz - covid_group_frequency
population_group_frequency = df_proportions_czechia.loc[index, 'True population proportion [%]']*sample_size_cz/100
population_others_frequency = sample_size_cz - population_group_frequency
# check normality: at least 10 cases in each category
if min([covid_group_frequency, covid_others_frequency, population_group_frequency, population_others_frequency]) >= 10:
df_proportions_czechia.loc[index, 'p-value'] = chisquare([covid_group_frequency, covid_others_frequency],
[population_group_frequency, population_others_frequency])[1]
df_proportions_czechia['H0_rejected'] = False
df_proportions_czechia.loc[~df_proportions_czechia['p-value'].isnull(), 'H0_rejected'] = multipletests(df_proportions_czechia.loc[~df_proportions_czechia['p-value'].isnull(), 'p-value'])[0]
df_proportions_czechia['Ho_rejected_bonferroni'] = df_proportions_czechia['p-value']*sum(~df_proportions_czechia['p-value'].isnull()) < 0.05
df_proportions_czechia[df_proportions_czechia['H0_rejected']]
Preprocessing the general population data to have it in a format compatable with the COVID-19 data.
canada_population = pd.read_csv('../input/canada-population/17100005.csv', low_memory=False)
canada_population = canada_population.rename(columns={'Sex': 'gender', 'Age group': 'age_group'})
canada_population_female = canada_population.loc[(canada_population['REF_DATE'] == 2019) &
(canada_population['gender'] == 'Females') &
(canada_population['GEO'] == 'Canada'), ['gender', 'age_group', 'VALUE']]
canada_population_male = canada_population.loc[(canada_population['REF_DATE'] == 2019) &
(canada_population['gender'] == 'Males') &
(canada_population['GEO'] == 'Canada'), ['gender', 'age_group', 'VALUE']]
def derive_age_category_canada(df, age_start, age_final):
"""
The function computes counts in the general Canada population proportions per age groups found in data on Canada COVID-19 patients.
The computed record is appended to the general population DataFrame.
@param df: The general population pandas DataFrame
@param age_start: lower bound of the new age group interval (included)
@param age_final: upper bound of the new age group interval (excluded)
@return: the general population pandas DataFrame updated with the newly computed category
"""
row_val = 0
i = 0
while age_start + i*5 < age_final:
row_val += df.loc[df['age_group'] == f'{age_start + i*5} to {age_start + i*5 + 4} years', 'VALUE'].values[0]
i += 1
row_df = pd.DataFrame({'gender': [df['gender'].values[0]], 'age_group': [f'{age_start}-{age_final}'], 'VALUE': [row_val]})
return df.append(row_df, ignore_index=True)
for age_group_name in tqdm(numeric_age_groups, desc='Unifying age groups..'):
canada_population_female = derive_age_category_canada(canada_population_female, *map(int, age_group_name.split('-')))
canada_population_male = derive_age_category_canada(canada_population_male, *map(int, age_group_name.split('-')))
canada_population_joined = pd.concat((canada_population_female, canada_population_male))
the_whole_population_size = canada_population_joined.loc[canada_population_joined['age_group'] == 'All ages', 'VALUE'].sum()
canada_population_joined['True population proportion [%]'] = np.nan
for index, row in df_proportions_canada.iterrows():
if row['gender'] == 'both-genders':
true_population_count = canada_population_joined.loc[canada_population_joined['age_group'] == row['age_group'], 'VALUE'].sum()
else:
true_population_count = canada_population_joined.loc[(canada_population_joined['age_group'] == row['age_group']) &
(canada_population_joined['gender'] == row['gender']) , 'VALUE'].values[0]
df_proportions_canada.loc[index, 'True population proportion [%]'] = 100 * true_population_count / the_whole_population_size
compare_proportions_visually(df_proportions_canada, 'Females',
'Canada, Females: COVID-19 proportions vs. Population proportions')
compare_proportions_visually(df_proportions_canada, 'Males',
'Canada, Males: COVID-19 proportions vs. Population proportions')
df_proportions_canada_20plus = get_20_plus_proportions(df_proportions_canada, youth_category='0-19')
compare_proportions_visually(df_proportions_canada_20plus, 'Females',
'Canada, Females over 20: COVID-19 proportions vs. Population proportions')
compare_proportions_visually(df_proportions_canada_20plus, 'Males',
'Canada, Males over 20: COVID-19 proportions vs. Population proportions')
chisquare((df_proportions_canada.loc[df_proportions_canada['gender'] == 'Females', 'Sample proportion [%]']*sample_size_canada/100).map(int).values,
(df_proportions_canada.loc[df_proportions_canada['gender'] == 'Females', 'True population proportion [%]']*sample_size_canada/100).map(int).values)
chisquare((df_proportions_canada_20plus.loc[df_proportions_canada_20plus['gender'] == 'Females', 'Sample proportion [%]']*sample_size_canada/100).map(int).values,
(df_proportions_canada_20plus.loc[df_proportions_canada_20plus['gender'] == 'Females', 'True population proportion [%]']*sample_size_canada/100).map(int).values)
chisquare((df_proportions_canada.loc[df_proportions_canada['gender'] == 'Males', 'Sample proportion [%]']*sample_size_canada/100).map(int).values,
(df_proportions_canada.loc[df_proportions_canada['gender'] == 'Males', 'True population proportion [%]']*sample_size_canada/100).map(int).values)
chisquare((df_proportions_canada_20plus.loc[df_proportions_canada_20plus['gender'] == 'Males', 'Sample proportion [%]']*sample_size_canada/100).map(int).values,
(df_proportions_canada_20plus.loc[df_proportions_canada_20plus['gender'] == 'Males', 'True population proportion [%]']*sample_size_canada/100).map(int).values)
We obtain the same conclusion for Canada, rejecting the null hypothesis about random age groups targeting by COVID-19 with significance levels under 0.005.
Let's now examine each age group separately.
Let's illustrate how I'm going to do it on one concrete example. My mother belongs to Women between 50-59, I'll use this group for running example. Steps for quantative estimation:
We estimate population proportion of Women between 50-59 based on COVID-19 data. Let's say, 99% confidence interval for the group proportion is 9.1-16.9%. It's an expected population proportion interval if the virus attacks this particular age/gender group at random.
Then we compute actual proportion of women between 50-59 in the population. Let's say, the actual proportion to be 7.0%.
In the running example, the population proportion is less than the COVID-19 proportion by at least 2.1%.
... and here goes a meme related to confidence intervals..
%%HTML
<img src=https://i.imgflip.com/uzpm3.jpg width="210" align="left">
As a reference refresher about the conditions which should be met before estimating confidence intervals for proportions, please, feel free to check this link, or the following video from Khan academy:
%%HTML
<iframe width="560" height="315" src="https://www.youtube.com/embed/8E8YQY5qE3s?rel=0&controls=1&showinfo=0" frameborder="0" allowfullscreen></iframe>
To sum up, to have valid estimation of population proportion CI, the following must be true:
We needed to ensure points 2. and 3., because non-randomeness must be the only factor which can break our proportion estimates. Then if based on positive COVID-19 cases we obtain an estimate for population proportion of women between 50-59 that is higher than the general population proportion, then obviously the virus picks women between 50-59 more often than at random.
Let's estimate 99% confidence intervals for each age-sex proportion.
Let's compute both simultaneous confidence interval estimations for multinomial proportions, and confidence intervals for binomial proportions, e.g. group female50_59 vs. all other groups.
def estimate_proportions_conf_intervals(df_proportions, sample_size):
"""
A function estimates confidence intervals for population proportion of each age-gender category based on COVID-19 data
Both multinomial and binomial proportion CI's are being computed
@param df_proportions: a pandas DataFrame with population and COVID-19 proportions, containing columns ['age_group', 'gender', <covid_col>, 'True population proportion [%]']
@param sample_size: number of all COVID-19 patients with age-gender records
@return: a pandas DataFrame with proportions updated with confidence interval columns
"""
binomial_population_proportion_lb_list = []
binomial_population_proportion_ub_list = []
age_gender_group_sizes = []
for index, row in df_proportions.iterrows():
age_gender_group_size = row['Sample proportion [%]'] * sample_size/100
age_gender_group_sizes.append(age_gender_group_size)
conf_interval_lb, conf_interval_ub = proportion_confint(age_gender_group_size, sample_size, alpha=0.01)
binomial_population_proportion_lb_list.append(100*conf_interval_lb)
binomial_population_proportion_ub_list.append(100*conf_interval_ub)
conf_intervals = multinomial_proportions_confint(age_gender_group_sizes, alpha=0.01)
population_proportion_lb_list = list(100*conf_intervals[:, 0])
population_proportion_ub_list = list(100*conf_intervals[:, 1])
df_proportions['Population proportion CI LB [%]'] = population_proportion_lb_list
df_proportions['Population proportion CI UB [%]'] = population_proportion_ub_list
df_proportions['Binomial population proportion CI LB [%]'] = binomial_population_proportion_lb_list
df_proportions['Binomial population proportion CI UB [%]'] = binomial_population_proportion_ub_list
return df_proportions
df_proportions_canada = estimate_proportions_conf_intervals(df_proportions_canada, sample_size_canada)
df_proportions_czechia = estimate_proportions_conf_intervals(df_proportions_czechia, sample_size_cz)
The lower the difference between the COVID-19 proportion and the general population proportion, the lower the risk of contracting COVID-19. These cases are depicted with green bars.
On the other hand, the higher the difference, the higher the risk of contracting COVID-19. It's depicted in red.
def plot_increase_in_covid_data(gender_proportions_pivoted_table, title):
"""
Visualization of the proportion increase in COVID-19 data vs. general population data, separately per each age group
@param gender_proportions_pivoted_table: pandas DataFrame with precomputed data for visual comparison between population and COVID-19 proportions,
the DataFrame contains columns 'COVID-19', and 'General Population', indexed by age groups
@param title: title of visualization
"""
gender_proportions_pivoted_table['Increase in COVID-19 proportion'] = gender_proportions_pivoted_table['COVID-19'] - gender_proportions_pivoted_table['General Population']
gender_proportions_pivoted_table['Increase in COVID-19 proportion'] = gender_proportions_pivoted_table['Increase in COVID-19 proportion'].map(lambda x: 0 if x<=0 else x)
plt.figure(figsize=(12, 7))
xticks = list(range(len(gender_proportions_pivoted_table)))
plt.bar(xticks, gender_proportions_pivoted_table['General Population'], color='darkorange', edgecolor='white', label='General Population')
plt.bar(xticks, gender_proportions_pivoted_table['Increase in COVID-19 proportion'], bottom=gender_proportions_pivoted_table['General Population'],
color='darkred', edgecolor='white', label='Increase in COVID-19 proportion\n(HIGHER EXPOSURE TO COVID-19)')
plt.xticks(xticks, gender_proportions_pivoted_table.index)
plt.xlabel('Age Group')
plt.legend(loc='upper left', bbox_to_anchor=(1,1))
plt.title(f'{title}', fontsize=18)
plot_increase_in_covid_data(compare_proportions_visually(df_proportions_czechia, 'Female', return_pivoted_table=True),
title='Czechia, Females: proportion increase in COVID-19 data')
plot_increase_in_covid_data(compare_proportions_visually(df_proportions_canada, 'Females', return_pivoted_table=True),
title='Canada, Females: proportion increase in COVID-19 data')
plot_increase_in_covid_data(compare_proportions_visually(df_proportions_czechia, 'Male',
return_pivoted_table=True),
title='Czechia, Males: proportion increase in COVID-19 data')
plot_increase_in_covid_data(compare_proportions_visually(df_proportions_canada, 'Males',
return_pivoted_table=True),
title='Canada, Males: proportion increase in COVID-19 data')
Here we also visualize in a column cases when COVID-19 proportion is lower than population proportion. Such 'safer' groups have green color.
def display_tabular_diffs_cz(col_name='Sample proportion [%]', col_name_2=None, attribute_name='Czechia, Difference between COVID-19 cases proportion and Population proportion [%]'):
"""
Tabular visualization of the proportion increase/decrease in COVID-19 data vs. general population data using stacked bars
@param col_name: a column name containing sample proportion, or confidence interval lower bound when using confidence intervals for visualization
@param col_name_2: a column name containing confidence interval upper bound when using confidence intervals for visualization, otherwise should be None
@param attribute_name: column name for the COVID-19 vs. population difference
"""
if col_name_2 is None:
df_proportions_czechia[attribute_name] = (df_proportions_czechia[col_name] -
df_proportions_czechia['True population proportion [%]'])
else:
proportion_diffs = df_proportions_czechia['Sample proportion [%]'] - df_proportions_czechia['True population proportion [%]']
df_proportions_czechia[attribute_name] = np.nan
lb_diff = df_proportions_czechia[col_name] - df_proportions_czechia['True population proportion [%]']
df_proportions_czechia.loc[proportion_diffs > 0, attribute_name] = lb_diff[proportion_diffs > 0].map(lambda x: max(0, x))
ub_diff = df_proportions_czechia[col_name_2] - df_proportions_czechia['True population proportion [%]']
df_proportions_czechia.loc[proportion_diffs < 0, attribute_name] = ub_diff[proportion_diffs < 0].map(lambda x: min(0, x))
display((df_proportions_czechia[['gender', 'age_group', attribute_name]]
.sort_values(by=['gender', 'age_group'])
.style.bar(subset=[attribute_name], align='mid', color=['#5fba7d', '#d65f5f'])
.set_properties(subset=[attribute_name], **{'width': '490px'})))
df_proportions_czechia.loc[~df_proportions_czechia['H0_rejected'], attribute_name] = 0
prepated_data = (df_proportions_czechia.loc[(df_proportions_czechia['gender'] == 'Male'), ['gender', 'age_group', attribute_name]]
.sort_values(by=['gender', 'age_group']))
display(prepated_data.style.bar(subset=[attribute_name], align='mid', color=['#5fba7d', '#d65f5f']).set_properties(subset=[attribute_name], **{'width': '520px'}))
df_proportions_czechia.loc[~df_proportions_czechia['H0_rejected'], attribute_name] = 0
prepated_data = (df_proportions_czechia.loc[(df_proportions_czechia['gender'] == 'Female'), ['gender', 'age_group', attribute_name]]
.sort_values(by=['gender', 'age_group']))
display(prepated_data.style.bar(subset=[attribute_name], align='mid', color=['#5fba7d', '#d65f5f']).set_properties(subset=[attribute_name], **{'width': '520px'}))
display_tabular_diffs_cz()
attribute_name = 'Canada, Difference between COVID-19 cases proportion and Population proportion [%]'
df_proportions_canada[attribute_name] = (df_proportions_canada['Sample proportion [%]'] -
df_proportions_canada['True population proportion [%]'])
(df_proportions_canada[['gender', 'age_group', attribute_name]]
.sort_values(by=['gender', 'age_group'])
.style.bar(subset=[attribute_name], align='mid', color=['#5fba7d', '#d65f5f'])
.set_properties(subset=[attribute_name], **{'width': '490px'}))
Based on comparison of COVID-19 95% confidence interval lower level vs. population proportion.
This indicates groups definitely in danger based on data.
display_tabular_diffs_cz(col_name='Population proportion CI LB [%]',
col_name_2='Population proportion CI UB [%]',
attribute_name='Czechia, Difference between COVID-19 confidence interval bounds and Population proportion [%]')
data_input = compare_proportions_visually(df_proportions_czechia, 'Female',
covid_col='Binomial population proportion CI LB [%]',
return_pivoted_table=True)
plot_increase_in_covid_data(data_input,
title='Czechia, Females: conservative COVID-19 proportion increase (Binomial CI)')
data_input = compare_proportions_visually(df_proportions_czechia, 'Female',
covid_col='Population proportion CI LB [%]',
return_pivoted_table=True)
plot_increase_in_covid_data(data_input,
title='Czechia, Females: conservative COVID-19 proportion increase')
data_input = compare_proportions_visually(df_proportions_canada, 'Females', '',
covid_col='Binomial population proportion CI LB [%]',
return_pivoted_table=True)
plot_increase_in_covid_data(data_input,
title='Canada, Females: conservative COVID-19 proportion increase (Binomial CI)')
data_input = compare_proportions_visually(df_proportions_canada, 'Females',
covid_col='Population proportion CI LB [%]',
return_pivoted_table=True)
plot_increase_in_covid_data(data_input,
title='Canada, Females: conservative COVID-19 proportion increase')
data_input = compare_proportions_visually(df_proportions_czechia, 'Male', '',
covid_col='Binomial population proportion CI LB [%]',
return_pivoted_table=True)
plot_increase_in_covid_data(data_input,
title='Czechia, Males: conservative COVID-19 proportion increase (Binomial CI)')
data_input = compare_proportions_visually(df_proportions_czechia, 'Male', '',
covid_col='Population proportion CI LB [%]',
return_pivoted_table=True)
plot_increase_in_covid_data(data_input,
title='Czechia, Males: conservative COVID-19 proportion increase')
data_input = compare_proportions_visually(df_proportions_canada, 'Males', '',
covid_col='Binomial population proportion CI LB [%]',
return_pivoted_table=True)
plot_increase_in_covid_data(data_input,
title='Canada, Males: conservative COVID-19 proportion increase (Binomial CI)')
data_input = compare_proportions_visually(df_proportions_canada, 'Males', '',
covid_col='Population proportion CI LB [%]',
return_pivoted_table=True)
plot_increase_in_covid_data(data_input,
title='Canada, Males: conservative COVID-19 proportion increase')
We can see that multinomial confidence interval estimation, which deals with uncertainty in all categories simultaneously, provide more conservative CI estimation.
Using available data from The Czech Republice, we've demonstrated how to perform in-depth drill down, answering various questions with visualizations.
%%HTML
<img src=https://www.elklan.co.uk/images/blogs/thousand_words.jpg width="140" align="left">
%%HTML
<img src=https://pics.me.me/that-moment-when-having-children-starts-to-pay-off-23399548.png width="140" align="left">
A takeaway message for everyone: the age group with the majority of parents, which have to stay at home with children and therefore social-distant themselves more systematically, shows how crucial social-distancing is for COVID-19 prevention. Everyone can play a part!
Take care and stay healthy, everyone!